Current Issue : October - December Volume : 2020 Issue Number : 4 Articles : 5 Articles
Background: The number of applications of deep learning algorithms in\nbioinformatics is increasing as they usually achieve superior performance over\nclassical approaches, especially, when bigger training datasets are available. In deep\nlearning applications, discrete data, e.g. words or n-grams in language, or amino\nacids or nucleotides in bioinformatics, are generally represented as a continuous\nvector through an embedding matrix. Recently, learning this embedding matrix\ndirectly from the data as part of the continuous iteration of the model to optimize\nthe target prediction - a process called â??end-to-end learningâ?? - has led to state-ofthe-\nart results in many fields. Although usage of embeddings is well described in the\nbioinformatics literature, the potential of end-to-end learning for single amino acids,\nas compared to more classical manually-curated encoding strategies, has not been\nsystematically addressed. To this end, we compared classical encoding matrices,\nnamely one-hot, VHSE8 and BLOSUM62, to end-to-end learning of amino acid\nembeddings for two different prediction tasks using three widely used architectures,\nnamely recurrent neural networks (RNN), convolutional neural networks (CNN), and\nthe hybrid CNN-RNN.\nResults: By using different deep learning architectures, we show that end-to-end\nlearning is on par with classical encodings for embeddings of the same dimension\neven when limited training data is available, and might allow for a reduction in the\nembedding dimension without performance loss, which is critical when deploying\nthe models to devices with limited computational capacities. We found that the\nembedding dimension is a major factor in controlling the model performance.\nSurprisingly, we observed that deep learning models are capable of learning from\nrandom vectors of appropriate dimension.\nConclusion: Our study shows that end-to-end learning is a flexible and powerful\nmethod for amino acid encoding. Further, due to the flexibility of deep learning\nsystems, amino acid encoding schemes should be benchmarked against random\nvectors of the same dimension to disentangle the information content provided by\nthe encoding scheme from the distinguishability effect provided by the scheme....
Background: The latest works on CRISPR genome editing tools mainly employs deep\nlearning techniques. However, deep learning models lack explainability and they are\nharder to reproduce. We were motivated to build an accurate genome editing tool\nusing sequence-based features and traditional machine learning that can compete\nwith deep learning models.\nResults: In this paper, we present CRISPRpred(SEQ), a method for sgRNA on-target\nactivity prediction that leverages only traditional machine learning techniques and\nhand-crafted features extracted from sgRNA sequences. We compare the results of\nCRISPRpred(SEQ) with that of DeepCRISPR, the current state-of-the-art, which uses a\ndeep learning pipeline. Despite using only traditional machine learning methods, we\nhave been able to beat DeepCRISPR for the three out of four cell lines in the benchmark\ndataset convincingly (2.174%, 6.905% and 8.119% improvement for the three cell lines).\nConclusion: CRISPRpred(SEQ) has been able to convincingly beat DeepCRISPR in 3 out\nof 4 cell lines. We believe that by exploring further, one can design better features only\nusing the sgRNA sequences and can come up with a better method leveraging only\ntraditional machine learning algorithms that can fully beat the deep learning models....
Background: Currently the combination of molecular tools, imaging techniques and\nanalysis software offer the possibility of studying gene activity through the use of\nfluorescent reporters and infer its distribution within complex biological threedimensional\nstructures. For example, the use of Confocal Scanning Laser Microscopy\n(CSLM) is a regularly-used approach to visually inspect the spatial distribution of a\nfluorescent signal. Although a plethora of generalist imaging software is available to\nanalyze experimental pictures, the development of tailor-made software for every\nspecific problem is still the most straightforward approach to perform the best\npossible image analysis. In this manuscript, we focused on developing a simple\nmethodology to satisfy one particular need: automated processing and analysis of\nCSLM image stacks to generate 3D fluorescence profiles showing the average\ndistribution detected in bacterial colonies grown in different experimental conditions\nfor comparison purposes.\nResults: The presented method processes batches of CSLM stacks containing threedimensional\nimages of an arbitrary number of colonies. Quasi-circular colonies are\nidentified, filtered and projected onto a normalized orthogonal coordinate system,\nwhere a numerical interpolation is performed to obtain fluorescence values within a\nspatially fixed grid. A statistically representative three-dimensional fluorescent pattern\nis then generated from this data, allowing for standardized fluorescence analysis\nregardless of variability in colony size. The proposed methodology was evaluated by\nanalyzing fluorescence from GFP expression subject to regulation by a stressinducible\npromoter....
Background: Circular RNA (circRNA) has been extensively identified in cells and\ntissues, and plays crucial roles in human diseases and biological processes. circRNA\ncould act as dynamic scaffolding molecules that modulate protein-protein interactions.\nThe interactions between circRNA and RNA Binding Proteins (RBPs) are also deemed to\nan essential element underlying the functions of circRNA. Considering cost-heavy and\nlabor-intensive aspects of these biological experimental technologies, instead, the highthroughput\nexperimental data has enabled the large-scale prediction and analysis of\ncircRNA-RBP interactions.\nResults: A computational framework is constructed by employing Positive Unlabeled\nlearning (P-U learning) to predict unknown circRNA-RBP interaction pairs with kernel\nmodel MFNN (Matrix Factorization with Neural Networks). The neural network is\nemployed to extract the latent factors of circRNA and RBP in the interaction matrix, the\nP-U learning strategy is applied to alleviate the imbalanced characteristics of data\nsamples and predict unknown interaction pairs. For this purpose, the known circRNARBP\ninteraction data samples are collected from the circRNAs in cancer cell lines\ndatabase (CircRic), and the circRNA-RBP interaction matrix is constructed as the input of\nthe model. The experimental results show that kernel MFNN outperforms the other\ndeep kernel models. Interestingly, it is found that the deeper of hidden layers in neural\nnetwork framework does not mean the better in our model. Finally, the unlabeled\ninteractions are scored using P-U learning with MFNN kernel, and the predicted\ninteraction pairs are matched to the known interactions database. The results indicate\nthat our method is an effective model to analyze the circRNA-RBP interactions.\nConclusion: For a poorly studied circRNA-RBP interactions, we design a prediction\nframework only based on interaction matrix by employing matrix factorization and\nneural network. We demonstrate that MFNN achieves higher prediction accuracy, and it\nis an effective method....
Background: Recently, it has become possible to collect next-generation DNA\nsequencing data sets that are composed of multiple samples from multiple biological\nunits where each of these samples may be from a single cell or bulk tissue. Yet, there\ndoes not yet exist a tool for simulating DNA sequencing data from such a nested\nsampling arrangement with single-cell and bulk samples so that developers of analysis\nmethods can assess accuracy and precision.\nResults: We have developed a tool that simulates DNA sequencing data from\nhierarchically grouped (correlated) samples where each sample is designated bulk or\nsingle-cell. Our tool uses a simple configuration file to define the experimental\narrangement and can be integrated into software pipelines for testing of variant callers\nor other genomic tools.\nConclusions: The DNA sequencing data generated by our simulator is representative\nof real data and integrates seamlessly with standard downstream analysis tools....
Loading....